Progress Memo 2

Final Project
Data Science 1 with R (STAT 301-1)

Author

Chelsea Nelson

Published

November 23, 2023

Github Repo Link

Progress Summary

Data Wrangling

In terms of wrangling and fixing up my data before I officially using it, I made sure that I tackled on how I was going to work around the missingness that was in my dataset. After further investigation, I realized that all of the missing values for my variables corresponded to one specific county and its multiple different family cases. Thus in this case, I decided it would be best to fully remove the observations of that particular county from my dataset, as I felt leaving it in would case more problems in terms of furthering my analysis than taking it out. Thus below I have shown the variables that previously had the instances of missingness to further assure that the missingness in my data in gone. Thus I can move forward in my analysis without the worry of navigating around missingness in particular variables that I want to use.

variable n_miss pct_miss
median_family_income 0 0
num_counties_in_st 0 0
st_cost_rank 0 0
st_med_aff_rank 0 0
st_income_rank 0 0

After navigating through how I was going to handle missingness in my dataset, I move forward onto adding variables that I feel will help create different questions and associations in my analysis. These added variables include one which provides information on the minimum wage in each state for 2022 (minimum_wage) 1, whiles the other showcases the associated geographical region of each state (region). 2 Additionally, I altered the variable type of the metro variable, to be representative as a factor with 0 being that the county is located in a nonmetropolitan area, and with a 1 representing the county being located in a metropolitan area. Below I have provided the updated version of my dataset, including both my new variables. At this point, I still plan to add the racial majority makeup for each county, however I am still looking for ways that I can easily input this information without it being too much of a hassle. Thus currently, I have furthered my analysis without this information, but have left spaces and plan to still include it in my final project.

Current EDA

Univariate Analysis

In terms of my univariate analysis, I looked at both the cateogrical and numerical variables, finding the most interesting statistics and figures within my analysis of the numerical variables.

However before looking into my numerical variables, I believe it is important to highlight the difference in the amount of nonmetro areas to metro areas in the dataset to gauge if these geographical differences will have any impact of how I view and analyze my findings in the future.

Figure 1: Looking at Metro Status of Counties

Above in Figure 1 we see

Bivariate Analysis

Main findings so far

Questions that I have created

Next Steps

I plan to make a codebook for my dataset within RStudio, rather than making it in excel and then importing it into RStudio.

Multivariate Analysis

Research To Explore

Footnotes

  1. This information was sourced from Paycom 2023 Guide to Every State’s Minimum Wage.↩︎

  2. This information was sourced from Census Regions and Divisions of the United States.↩︎